DataFrame and Data Cleaning
In Python, a DataFrame is a 2-dimensional, labeled data structure provided by the pandas library. It is one of the most commonly used structures for data analysis and data science.
A DataFrame:
import pandas as pd
data = {
"Name": ["Rocky", "John", "Anna"],
"Age": [21, 22, 23],
"Marks": [85, 90, 88]
}
df = pd.DataFrame(data)
print(df)
data = [
["Sheetal", 21, 85],
["Shyam", 22, 90],
["Meena", 23, 88]
]
df = pd.DataFrame(data, columns=["Name", "Age", "Marks"])
print(df)
#access column
print(df["Name"])
#access specific row and column
print(df.loc[0, "Marks"])
df.head() # first 5 rows
df.tail() # last 5 rows
df.shape # (rows, columns)
df.columns # column names
df.info() # summary
We can add new column to existing dataframe :
df["Grade"] = ["A", "A+", "A"]
Modify Datadf["Marks"] = df["Marks"] + 5
Delete columndf.drop("Age", axis=1, inplace=True)
Filtering Datahigh_marks = df[df["Marks"] > 85] print(high_marks)
Why to use Python DataFrame?
df = pd.DataFrame({
"Product": ["Laptop", "Mobile", "Tablet"],
"Price": [80000, 30000, 20000]
})
print(df[df["Price"] > 25000])
Data Cleaning in a DataFrame means detecting and correcting errors, missing values, duplicates, or inconsistent data so that the dataset becomes accurate and ready for analysis.
Check Missing Values
import pandas as pd
df = pd.read_csv("data.csv")
print(df.isnull())
print(df.isnull().sum())
Remove Rows with Missing Values
df["Age"] = df["Age"].fillna(df["Age"].mean())
Check Duplicatesdf.duplicated()
Remove Duplicatesdf = df.drop_duplicates()
Check Data Typesprint(df.dtypes)
Renaming Columnsdf.rename(columns={"old_name": "new_name"}, inplace=True)
Removing Unnecessary Columnsdf.drop("Address", axis=1, inplace=True)
Example filtering outliers:df = df[df["Age"] < 100]
Standardizing Text Datadf["Name"] = df["Name"].str.lower() df["Name"] = df["Name"].str.strip()
Simple Example of Data Cleaning
import pandas as pd
data = {
"Name": ["Ram", "Hari", "Hari", None],
"Age": [20, None, 22, 21],
"Marks": [80, 90, 90, 85]
}
df = pd.DataFrame(data)
# Fill missing values
df["Age"].fillna(df["Age"].mean(), inplace=True)
# Remove duplicates
df.drop_duplicates(inplace=True)
# Remove rows with missing names
df.dropna(subset=["Name"], inplace=True)
print(df)